Load Data

First install (if needed) and upload the libraries we’ll use for the analysis.

Now let’s upload the data. I’ve subset the data to the 1170 participants who have both exposome and proteome measures.

load("/cloud/project/Labs/Clustering/helix.RData")

This will show the similar dimensions of each data set.

dim(covars)
## [1] 1170   14
dim(phenos)
## [1] 1170    7
dim(expsms)
## [1] 1170  223
dim(prtm)
## [1]   36 1170

Note that the proteome has proteins as rows and participant values as columns. This is a standard formatting for omic data. Some omic-specific packages analyze the data in this form. An example is the limma package for fitting linear models to gene expression data.

Covariates

Let’s visualize the distribution of child age.

ggplot(covars, aes(x=hs_child_age_None)) + 
 geom_histogram(aes(y=after_stat(density)), colour="black", fill="white")+
 geom_density(alpha=.2, fill="#FF6666") +
 xlab("Child Age")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Children were recruited in different cohorts. Let’s also see how child’s age varies by cohort.

with(covars,boxplot(hs_child_age_None ~ h_cohort))

Let’s investigate the correlations between children’s phenotypes & covariates. First, a look at the covariate values.

str(covars)
## 'data.frame':    1170 obs. of  14 variables:
##  $ ID               : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ h_cohort         : Factor w/ 6 levels "1","2","3","4",..: 4 4 4 2 3 1 2 1 2 1 ...
##  $ e3_sex_None      : Factor w/ 2 levels "female","male": 2 2 2 1 2 1 2 1 1 1 ...
##  $ e3_yearbir_None  : Factor w/ 7 levels "2003","2004",..: 6 5 6 3 3 5 1 6 2 6 ...
##  $ h_mbmi_None      : num  25.5 26.5 30.1 21 22.2 ...
##  $ hs_wgtgain_None  : num  17 18 11 21 20 30 20 20 12 9 ...
##  $ e3_gac_None      : num  41 41 39 39.3 43 ...
##  $ h_age_None       : num  28 22.8 34.2 32.7 20.9 ...
##  $ h_edumc_None     : Factor w/ 3 levels "1","2","3": 2 3 3 1 1 3 1 2 2 1 ...
##  $ h_native_None    : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 1 3 2 ...
##  $ h_parity_None    : Factor w/ 3 levels "0","1","2": 1 2 2 2 1 1 3 1 2 3 ...
##  $ hs_child_age_None: num  6.17 6.99 6.11 10.14 9.45 ...
##  $ hs_c_height_None : num  1.22 1.22 1.28 1.34 1.37 ...
##  $ hs_c_weight_None : num  23.4 27.6 37.5 27.7 34 21.5 29.8 22.5 24.3 30.3 ...

I’m going to remove the first two columnns (ID and cohort) and convert the rest to numeric values so we can compute correlations. Some are 3-level ordered categorical variables (i.e., education, native, parity).

n_covars <- map_dfc(covars[,-c(1:2)],as.numeric)

Now let’s check the phenotypes.

str(phenos)
## 'data.frame':    1170 obs. of  7 variables:
##  $ ID              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ e3_bw           : int  4100 4158 4110 3270 3950 2900 3350 3580 3000 3780 ...
##  $ hs_asthma       : num  0 0 1 0 0 1 0 1 0 1 ...
##  $ hs_zbmi_who     : num  0.3 0.41 3.33 -0.76 0.98 -0.08 0.04 -0.1 -1.78 1.75 ...
##  $ hs_correct_raven: int  18 25 13 28 19 19 34 16 35 32 ...
##  $ hs_Gen_Tot      : num  84 39 40 54.5 18 ...
##  $ hs_bmi_c_cat    : Factor w/ 4 levels "1","2","3","4": 2 2 4 2 2 2 2 2 2 3 ...

We’ll remove ID, and make the remaining columns numeric in preparation for a pairwise correlation analysis.

n_phenos <- map_dfc(phenos[,-1],as.numeric)
corrplot(cor(n_covars[,-1],n_phenos),
         method = "square",
         cl.cex=0.6,cl.ratio=0.5,cl.align.text = "l")

This shows positive associations between birthweight and gestational age, body mass index and child’s weight (6-11 years old) and intelligence quotient measure using the RAVEN test with child’s age, height, weight, and negative correlation of intelligence quotient with year of birth.

Exposome

We have a large number of exposome variables, 222 in total. Let’s use a heatmap to visualize the correlation between exposures variables (columns). Again we need to remove the first column that is subject ID. Studying the codebook, we see that many of these variables are dichotomous or tertiles, but I’m just leaving them in as ordered categorical variables.

n_exposome <- map_dfc(expsms[,-1],as.numeric)
cor_exposome <- cor(n_exposome)

The ComplexHeatmap package gives us a nice function to visualize heatmaps, while clustering the rows and columns. The default clustering method computes Euclidean distance between samples with complete linkage for combining clusters.

htc <- ComplexHeatmap::Heatmap(cor_exposome,
                               name = "Cor",
                show_column_names = FALSE,
              show_row_names = FALSE)
ComplexHeatmap::draw(htc)

These variables are a combination of exposures during the pregnancy time period and exposures during the postnatal period. Let’s break it down into two matrices and study these time periods separately. We’ll do this by finding the variable names for the two different time periods from the codebook.

dim(n_exposome)
## [1] 1170  222
preg_expsm <- 
    codebook %>%
        filter(period=="Pregnancy" & domain !="Covariates") %>%
        filter(domain !="Phenotype") %>%
        dplyr::select(variable_name)
preg_expsm <- as.character(unlist((preg_expsm)))

post_expsm <- 
  codebook %>%
        filter(period=="Postnatal" & domain !="Covariates") %>%
        filter(domain !="Phenotype") %>%
        dplyr::select(variable_name)
post_expsm <- as.character(unlist((post_expsm)))

Next up: apply dimension reduction techniques to see if the subjects cluster based on their profile of exposures in a lower dimensional space.

Prenatal Exposome (PCA, k-means)

Scaling of the data is important for PCA. A PCA of the covariance matrix is different than a PCA of the correlation matrix (using standardized variables). We standardize all variables (features) to avoid having highly variable features contribute more to the summaries.

We select the columns of interest from the exposome matrix, and scale them like this:

pregX  <- scale(n_exposome[,preg_expsm])

First let’s conduct principle components analysis. I’m going to cluster the samples using Kmeans, and color them in the PC scatterplot by their cluster assignment.

How many clusters should we fit?

set.seed(12)
factoextra::fviz_nbclust(pregX, kmeans, method = "gap_stat",nboot=40)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

There were a few issues with non-convergence, but this plot says we should select 6 clusters.

How does this result compare to using the silhouette width? Silhouette width is a direct calculation that doesn’t encounter the convergence warning.

fviz_nbclust(pregX, kmeans, method = "silhouette")

They both pick 6 clusters.

Let’s perform PCA and visualize the results with a scatterdiagram of PC1 vs PC2. We’ll also do a k-means cluster analysis picking 6 clusters and color the points in our scatter diagram by their cluster assignment.

my.pca <- prcomp(pregX,retx=TRUE)
dfx_pregX <- as.data.frame(x = my.pca$x)

set.seed(46)
km6_pregX <- stats::kmeans(pregX,centers=6,nstart = 200)

ggplot(dfx_pregX,  aes(x=PC1, y=PC2, 
                       color =factor(km6_pregX$cluster))) + 
        geom_point(size=2.5) +
       labs(color="Cluster")
PCA of 1170 children using exposome variables from pregnancy.

PCA of 1170 children using exposome variables from pregnancy.

How much of the variation in PC1 does cluster explain?

fit <- lm(dfx_pregX$PC1 ~ factor(km6_pregX$cluster))
summary(fit)$r.squared
## [1] 0.8184068

Cluster explains a lot of the variability of the first PC (\(R^2 = 82\%\)).

The organizers warned us that the exposome captures cohort. Let’s compare our cluster assignment to cohort.

table(km6_pregX$cluster,covars$h_cohort)
##    
##       1   2   3   4   5   6
##   1   0   0   0   1   0 190
##   2   0   0   0   0 218   0
##   3 192   4   7   4   4   1
##   4   0   0 204   0   0   0
##   5   1 146   0   0   1   0
##   6   0   0   0 197   0   0

Yup. We ‘found’ cohort! The labels differ between the different categorical variables, but only 23 samples are differently assigned between assigned cluster and cohort.

How much variation in the exposome do we explain with the first few PCs?

totvar <- sum(my.pca$sdev^2)
plot(1:80,cumsum(my.pca$sdev[1:80]^2)/totvar,ylim=c(0,1),
     type='l',xlab="PC",ylab="% Total Var")

The first 10 PCs explain 40% of the variation due to the exposome during pregnancy.

Postnatal Exposome (PCA, k-means)

Now let’s repeat this analysis using the postnatal exposome. First, we scale (normalize) the variables (columns).

postnX <- scale(n_exposome[,post_expsm])

How many clusters should we fit?

fviz_nbclust(postnX, kmeans, method = "gap_stat",nboot=40)
## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

## Warning: did not converge in 10 iterations

Again we had some non-convergence issues. What does silhouette width give us?

fviz_nbclust(postnX, kmeans, method = "silhouette")

Both give us 6 clusters, so that’s what we’ll use.

my.pca <- prcomp(postnX,retx=TRUE)
dfx_postnX <- as.data.frame(x = my.pca$x)

set.seed(42)
km6_postnX <- stats::kmeans(postnX,centers=6,nstart = 200)

ggplot(dfx_postnX,  aes(x=PC1, y=PC2, 
          color = factor(km6_postnX$cluster))) + 
          geom_point(size=2.5) +
       labs(color="Cluster")
PCA of 1170 children using exposome variables from postnatal time.

PCA of 1170 children using exposome variables from postnatal time.

Again, it looks like we explain a lot of the variation of PC1 with cluster assignment. What is the \(R^2\)?

fit <- lm(dfx_postnX$PC1 ~ factor(km6_postnX$cluster))
summary(fit)$r.squared
## [1] 0.795881

How closely do the cluster assignments from the prenatal and postnatal exposomes agree?

table(km6_pregX$cluster,km6_postnX$cluster)
##    
##       1   2   3   4   5   6
##   1   0   0   0   1 190   0
##   2   0   0   0   0   0 218
##   3   2   5 194   5   1   5
##   4   0 191   4   9   0   0
##   5   0   2   2 143   0   1
##   6 196   0   1   0   0   0

Very closely!

Dynamic scatterplot

With the plotly package you can explore the cluster structure of the output interactively, with the ability to rescale and rotate the figure.

plot_ly(x=dfx_postnX$PC1, y=dfx_postnX$PC2, z=dfx_postnX$PC3, type="scatter3d", mode = "markers", color=factor(km6_postnX$cluster))

Interpreting PCs

We can use correlations to look for the exposures that correlate with PCs within the separate times (pregnancy and postnatal.)

cormat <- cor(pregX,dfx_pregX[,1:10])
dim(cormat)
## [1] 88 10
hnr <- nrow(cormat)/2
corrplot(cormat[1:hnr,],is.corr=FALSE,tl.cex=0.5,
         cl.cex=0.6,cl.ratio=0.5,cl.align.text = "l")

corrplot(cormat[-c(1:hnr),],is.corr=FALSE,tl.cex=0.5,
         cl.cex=0.6,cl.ratio=0.5,cl.align.text = "l")

Prenatal PCBs are the exposures that are most correlated with PC1.

Now let’s look at exposures that explain the most variation in the postnatal period.

cormat <- cor(postnX,dfx_postnX[,1:10])
dim(cormat)
## [1] 134  10
hnr <- nrow(cormat)/2
corrplot(cormat[1:hnr,],is.corr=FALSE,tl.cex=0.5,
         cl.cex=0.6,cl.ratio=0.5,cl.align.text = "l")

corrplot(cormat[-c(1:hnr),],is.corr=FALSE,tl.cex=0.5,
         cl.cex=0.6,cl.ratio=0.5,cl.align.text = "l")

CCA

Now let’s try a canonical correlation analysis to look for summary variables that correlate between the prenatal and postnatal periods. I’m not sure an exposome expert would try this, but my thinking is that the exposomes from both time points are informative of the same cluster structure, so there must be exposome summary variables at each time point that should correlate with each other.)

Now just like for PCA, we want to work with the matrix after standardizing the variables(/features).

cc.out <- cc(pregX,postnX)
names(cc.out)
## [1] "cor"    "names"  "xcoef"  "ycoef"  "scores"

Let’s see the 20 largest correlation estimates.

#corrplot(diag(cc.out$cor[1:20]))
plot(cc.out$cor,type="b")

The top 5 pairs of canonical variables are very highly correlated. We can access the scores directly from the output as well to look at their pairwise correlation.

names(cc.out$scores)
## [1] "xscores"        "yscores"        "corr.X.xscores" "corr.Y.xscores"
## [5] "corr.X.yscores" "corr.Y.yscores"

Let’s take a look at the first pair of canonical variables:

ccpr1 <- cbind.data.frame(cxs1 = cc.out$scores$xscores[,1],
                          cys1 = cc.out$scores$yscores[,1])
ggplot(ccpr1, aes(x=cxs1, y=cys1, 
          color = factor(km6_postnX$cluster))) + 
          geom_point(size=2.5) + 
          labs(color="Cluster")

These linear combinations do a good job of separating clusters 1, 5 and 6 from an overlapping group of clusters (2,3,4).

How about the next pair of canonical variables?

ccpr2 <- cbind.data.frame(cxs2 = cc.out$scores$xscores[,2],
                          cys2 = cc.out$scores$yscores[,2])
ggplot(ccpr2, aes(x=cxs2, y=cys2, 
          color = factor(km6_postnX$cluster))) + 
          geom_point(size=2.5) + 
          labs(color="Cluster")

Now we can separate group 2 from the others.

In low-dimensional space (n>p) the canonical variables are independent just like the PCs. Let’s see how the PCs and Canonical variables correlate with one another using the pregnancy exposome.

svmat <- cbind.data.frame(dfx_pregX[,1:3],
               cc.out$scores$xscores[,1:3])
corrplot::corrplot(cor(svmat),is.corr=FALSE)

PC1 is most correlated with the third canonical variable. That means canonical correlation prioritized finding variables that were more correlated with the 3rd PC when analyzing the pregnancy exposome alone.

Let’s jump to the next file and look at exposome & proteome together in a CCA.